Make my own bioinformatics parser with Automa.jl

💻 Problems to solve Parse UCSC chain format, and return a IntervalCollections with the query intervals as the values of the ref intervals.

plain

# UCSC chain format: a demo
chain 4900 chrY 58368225 + 25985406 25985566 chr5 151006098 - 43549808 43549970 2
16      0       2
60      4       0
10      0       4
70

plain

Design

⇾The chain format for each record could be briefly part into three:

  1. The header line of chain score tName tSize tStrand tStart tEnd qName qSize qStrand qEnd id;

  2. The alignment body of size dt dq;

  3. The alignment last line of size, and a blank line.

⇾The parser thus could be devided into four part:

  1. The header parsing step: add actions for each cell in header line, return the whole header var

  2. The alignment parsing step: add actions for each alignment body, push each alignment record to alignment var

  3. The alignment end parsing step: detect the alignment end, return the alignment var

  4. The record parsing step: detect the end of the record, return the whole record var, maybe convert during this step

Tip To speed up the file IO and parsing step (or to simplify the coding), we may not do string convertion during parsing, instead, we will convert the UInt8 IOstream to the format we required after the whole parsing step is done.
Dependencies

Implementation

Define Structure

The final storage structure: ChainMap <: IntervalCollection

The chain format needs to be parsed as a set of IntervalCollection ([ chr:start-end:strand:value, ... ]), with the ability to pair each target interval to its query interval matched. In a similar implementation (CrossMap) in python, the query regions was passed as the values of the target regions (as Strings) :How CrossMap.py store chain maps. In julia, I think we could do it more effectively, by nesting the IntervalTree of query as the value of the IntervaTree of target.

julia

using GenomicFeatures
using Formatting

julia> demo_array = [
    Interval(
        "tchr1",
        j,
        j+100,
        '+',
        Interval("qchr1", j+5, j+105, '+', sprintf1("QID%02d", i))
    ) for (i, j) in enumerate(1:10)
]

julia> typeof(demo_array)
Vector{Interval{Interval{String}}} (alias for Array{Interval{Interval{String}}, 1}) 

julia> demo_map = IntervalCollection(demo_array);
julia> typeof(demo_map)
IntervalCollection{Interval{String}}

julia

The intermediate structure: ChainRecord

Like those in Bed.jl, we would define a structure that stores all IOStreams related to one chain record, and the positions of each item in the IOStream, and define a function to convert it to the ChainMap format.

julia

mutable struct Record
    # ChainRecord, data and filled range
    data   ::Vector{UInt8}
    filled ::UnitRange{Int}
    # number of align subintervals
    naln   ::Int
    # indexes for header line
    score  ::UnitRange{Int}
    tname  ::UnitRange{Int}
    tsize  ::UnitRange{Int}
    tstrand::UnitRange{Int}
    tstart ::UnitRange{Int}
    tend   ::UnitRange{Int}
    qname  ::UnitRange{Int}
    qsize  ::UnitRange{Int}
    qstrand::UnitRange{Int}
    qstart ::UnitRange{Int}
    qend   ::UnitRange{Int}
    id     ::UnitRange{Int}
    # indexes for aligns
    alndata::Vector{UnitRange{Int}}
    alnend ::UnitRange{Int}
end

julia
Todo TO BE CONTINUE...

:x CrossMapUtilL331L334

Find the source here

python

if target_strand == '+':
    maps[source_name].add_interval( Interval(sfrom, sfrom+size,(target_name,
    tfrom, tfrom+size,target_strand)))
elif  target_strand == '-':
    maps[source_name].add_interval( Interval(sfrom, sfrom+size,(target_name,
    target_size - (tfrom+size), target_size - tfrom, target_strand)))

python